The current bridge infrastructure relies on a central bridge authority to collect, distribute, and publish bridge relay descriptors. We believe the current infrastructure can handle up to 10,000 bridges.
The scaling points involve the database of descriptors, the metrics portal and its ability to handle this many descriptors for analysis, and the reachability testing part of the code for the bridge authority. We should investigate scaling points to handle more than 10,000 bridge descriptors.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
The last sentence is the key here. We don't need to build and deploy a scalable bridge infrastructure. We need to write down thoughts and notes about how to scale the bridge db system to 100,000 or 1,000,000 bridges.
Karsten has previously said that we'll want to look very carefully at the bridge authority, BridgeDB, metrics, and maybe others. He's mostly worried about the bridge authority, but the BridgeDB and metrics will have to be extended as well.
I started this analysis by writing a small tool to generate sample data for BridgeDB and metrics-db. This tool takes the contents from one of Tonga's bridge tarball as input, copies them a given number of times, and overwrites the first two bytes of relay fingerprints in every copy with 0000, 0001, etc. The tool also fixes references between network statuses, server descriptors, and extra-info descriptors. This is sufficient to trick BridgeDB and metrics-db into thinking that relays in the copies are distinct relays. I used the tool to generate tarballs with 2, 4, 8, 16, 32, and 64 times as many bridge descriptors in them.
In the next step I fed the tarballs into BridgeDB and metrics-db. BridgeDB reads the network statuses and server descriptors from the latest tarball and writes them to a local database. metrics-db sanitizes two half-hourly created tarballs every hour, establishes an internal mapping between descriptors, and writes sanitized descriptors with fixed references to disk.
The attached graph shows the results.
The upper graph shows how the tarballs grow in size with more bridge descriptors in them. This growth is, unsurprisingly, linear. One thing to keep in mind here is that bandwidth and storage requirements to the hosts transferring and storing bridge tarballs are growing with the tarballs. We'll want to pay extra attention to disk space running out on those hosts.
The middle graph shows how long BridgeDB takes to load descriptors from a tarball. This graph is linear, too, which indicates that BridgeDB can handle an increase in the number of bridges pretty well. One thing I couldn't check is whether BridgeDB's ability to serve client requests is in any way affected during the descriptor import. I assume it'll be fine. Aaron, are there other things in BridgeDB that I overlooked that may not scale?
The lower graph shows how metrics-db can or cannot handle more bridges. The growth is slightly worse than linear. In any case, the absolute time required to handle 25K bridges is worrisome (I didn't try 50K). metrics-db runs in an hourly cronjob, and if that cronjob doesn't finish within 1 hour, we cannot start the next run and will be missing some data. We might have to sanitize bridge descriptors in a different thread or process than the one that fetches all the other metrics data. I can also look into other Java libraries to handle .gz-compressed files that are faster than the one we're using. So, we can probably handle 25K bridges somehow, and maybe even 50K. Somehow.
Finally, note that I left out the most important part of this analysis: can Tonga, or more generally, a single bridge authority handle this increase in bridges? I'm not sure how to test such a setting, or at least without running 50K bridges in a private network. I could imagine this requires some more sophisticated sample data generation including getting the crypto right and then talking to Tonga's DirPort. If there's an easy way to test this, I'll do it. If not, we can always hope for the best. What can go wrong.
If we end up with way too many bridges, here are a few things we'll want to look at updating:
Tonga still does a reachability test on each bridge every 21 minutes or so. Eventually the number of tls handshakes it's doing will overwhelm its cpu.
The tarballs we make every half hour have substantial overlap. If we have tens of thousands of descriptors, we would want to get smarter at sending diffs over to bridgedb.
Somebody should check whether bridgedb's interaction with users freezes while it's reading a new set of data.